This is a capstone project which is a part of Google Data Analytics Professional Certificate Course. In this project, information about the rides made by the customers of Cyclistic bike share company for the year 2023 is analyzed.
As a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago, the team needs information on maximizing the number of annual memberships for the future success of the organization. Therefore, marketing analyst team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.
This project uses the following data analysis approach:
Ask : Understanding the scenario and problem by identifying the business task and key stakeholders.
Prepare : Gathering the required data and check for its credibility and organize as per the requirement.
Process : Checking the data for any errors or inconsistencies and cleaning them to make the data more reliable. Transform the data to make it effective when analysis is done.
Analysis : Aggregate, organize and format your data so it’s useful and accessible.Perform calculations and identify trends and relationships within data.
Share : Create effective data visualizations to present your findings. Ensure your work is accessible.
Act : Prepare the deliverable including the top recommendations based on the analysis.
Cyclistic’s historical trip data to analyze and identify trends can be downloaded for previous 12 months from here link. (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license link.
In order to understand the business task SMART framework is used i.e, Specific, Measurable, Action-oriented, Relevant, Time-bound.
For the future success of the organization, The Director of Marketing team is planning to increase the number of annual members of the Cyclistic bike share program. The business task is to understand how casual riders and annual members use Cyclistic bikes differently so that new marketing strategy can be created to convert casual riders into annual members. Inorder to approve the new strategy and make decisions, the team needs a thorough analysis to get important data insights and recommendations.
The key stakeholders are The Marketing Director, Cyclistic Executive Team and Cyclistic marketing Analytics Team.
The Cyclistic dataset for the year 2023 is downloaded and it includes 12 seperate csv files with ride details for each month. There are 13 columns, including the ride IDs, bike type, start and end timestamp, start and end station name, location and rider type.
Using the ROCCC framework, the data is checked for its credibility as follows:
The dataset is saved as excel file with .csv extension. Inorder to make it compatible to perform multiple functions, they are saved as xlxs files and made a copy to maintain the original file as it is for future reference.
The dataset is saved as excel file with .csv extension. Inorder to make it compatible to perform multiple functions, they are saved as xlxs files and made a copy to maintain the original file as it is for future reference.
The xlxs files are checked for the errors or missing values. The dataset included inconsistent ride_id column and blank cells in station_name column which are removed as they only make up a small percentage of the entire data. To remove the specific data filter option is used.
To make the analysis phase easier, transformation is done on the dataset. It included creation of new columns named start_date, start_time, end_date, end_time, ride_length and weekday as follows:
The dataset is cleaned, transformed and ready for analysis. Descriptive analysis is performed on the data using R programming in RStudio IDE.
First, a new project is created required library packages are installed and loaded into the console.
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\grace\AppData\Local\Temp\Rtmp6xvdms\downloaded_packages
install.packages("skimr")
## package 'skimr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\grace\AppData\Local\Temp\Rtmp6xvdms\downloaded_packages
install.packages("janitor")
## package 'janitor' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\grace\AppData\Local\Temp\Rtmp6xvdms\downloaded_packages
install.packages("lubridate")
## package 'lubridate' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'lubridate'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\grace\AppData\Local\R\win-library\4.3\00LOCK\lubridate\libs\x64\lubridate.dll
## to
## C:\Users\grace\AppData\Local\R\win-library\4.3\lubridate\libs\x64\lubridate.dll:
## Permission denied
## Warning: restored 'lubridate'
##
## The downloaded binary packages are in
## C:\Users\grace\AppData\Local\Temp\Rtmp6xvdms\downloaded_packages
install.packages("dplyr")
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\grace\AppData\Local\R\win-library\4.3\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\grace\AppData\Local\R\win-library\4.3\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
##
## The downloaded binary packages are in
## C:\Users\grace\AppData\Local\Temp\Rtmp6xvdms\downloaded_packages
library(tidyverse)
library(skimr)
library(janitor)
library(lubridate)
library(dplyr)
Now, current working directory is set to the file path of the local device where the datasets are stored. Then, the data of 12 months are read and combined to a single dataframe named “cyclistic_complete_data” for further analysis.
setwd("C:/Data Analysis/Google Data Analytics Course/Case studies/Case study 1/Cyclistic-datasets/3.Cleaned dataset final")
cyclistic_complete_data <- list.files(pattern = "*.csv",full.names = T) %>% lapply(read_csv) %>% bind_rows()
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
Check the merged dataframe “cyclistic_complete_data” for any inconsistencies.
colnames(cyclistic_complete_data) #List of column names
## [1] "ride_id" "rideable_type" "Start_Date"
## [4] "Start_Time" "End_Date" "End_Time"
## [7] "start_station_name" "start_station_id" "end_station_name"
## [10] "end_station_id" "start_lat" "start_lng"
## [13] "end_lat" "end_lng" "member_casual"
## [16] "ride_length"
head(cyclistic_complete_data) #See the first 6 rows of data frame.
tail(cyclistic_complete_data) #See the last rows of data frame.
str(cyclistic_complete_data) #Displays the structure of data frame.
## spc_tbl_ [4,331,646 Ă— 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:4331646] "5B6500E1E58655C0" "AA65D25D69AF771F" "079FB2C196414482" "599623864C871207" ...
## $ rideable_type : chr [1:4331646] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ Start_Date : chr [1:4331646] "10-04-2023" "12-04-2023" "13-04-2023" "29-04-2023" ...
## $ Start_Time : 'hms' num [1:4331646] 17:34:00 12:29:00 17:39:00 20:57:00 ...
## ..- attr(*, "units")= chr "secs"
## $ End_Date : chr [1:4331646] "10-04-2023" "12-04-2023" "13-04-2023" "29-04-2023" ...
## $ End_Time : 'hms' num [1:4331646] 18:02:00 12:54:00 17:40:00 20:57:00 ...
## ..- attr(*, "units")= chr "secs"
## $ start_station_name: chr [1:4331646] "Avenue O & 134th St" "Cottage Grove Ave & 51st St" "Morgan Ave & 14th Pl" "Cottage Grove Ave & 51st St" ...
## $ start_station_id : chr [1:4331646] "20214" "TA1309000067" "TA1306000002" "TA1309000067" ...
## $ end_station_name : chr [1:4331646] "Avenue O & 134th St" "Cottage Grove Ave & 51st St" "Morgan Ave & 14th Pl" "Cottage Grove Ave & 51st St" ...
## $ end_station_id : chr [1:4331646] "20214" "TA1309000067" "TA1306000002" "TA1309000067" ...
## $ start_lat : num [1:4331646] 41.7 41.8 41.9 41.8 41.9 ...
## $ start_lng : num [1:4331646] -87.5 -87.6 -87.7 -87.6 -87.7 ...
## $ end_lat : num [1:4331646] 41.7 41.8 41.9 41.8 41.9 ...
## $ end_lng : num [1:4331646] -87.5 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr [1:4331646] "member" "member" "member" "member" ...
## $ ride_length : 'hms' num [1:4331646] 00:28:00 00:25:00 00:01:00 00:00:00 ...
## ..- attr(*, "units")= chr "secs"
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. Start_Date = col_character(),
## .. Start_Time = col_time(format = ""),
## .. End_Date = col_character(),
## .. End_Time = col_time(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character(),
## .. ride_length = col_time(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
cyclistic_complete_data_sorted <- arrange(cyclistic_complete_data,Start_Date) #Sorts the data in ascending order based on Start_Date column
print(cyclistic_complete_data_sorted)
## # A tibble: 4,331,646 Ă— 16
## ride_id rideable_type Start_Date Start_Time End_Date End_Time
## <chr> <chr> <chr> <time> <chr> <time>
## 1 06EF8DD39CF7170F electric_bike 01-01-2023 18:04 01-01-2023 18:09
## 2 278C5B9B24F9676B electric_bike 01-01-2023 15:34 01-01-2023 15:43
## 3 C736F89E76D0E94C electric_bike 01-01-2023 15:18 01-01-2023 15:28
## 4 E45F1695932FFAD9 electric_bike 01-01-2023 21:27 01-01-2023 21:36
## 5 1CE75766A8D48796 electric_bike 01-01-2023 17:11 01-01-2023 17:13
## 6 42F33089E1A8B14F docked_bike 01-01-2023 12:00 01-01-2023 14:53
## 7 FD00061B604B8AAD electric_bike 01-01-2023 16:25 01-01-2023 16:44
## 8 5A9F13080F18C073 electric_bike 01-01-2023 16:21 01-01-2023 16:45
## 9 66E95481EAE4FAB7 classic_bike 01-01-2023 17:21 01-01-2023 17:28
## 10 0C6BD4EBC8C2459D classic_bike 01-01-2023 02:16 01-01-2023 02:23
## # ℹ 4,331,636 more rows
## # ℹ 10 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <time>
There are rows with ride_length “00:00” which are incorrect entries in the data set and cannot be considered for analysis. So, delete those rows using simple filter function.
cyclistic_complete_data_without_0 <- cyclistic_complete_data_sorted[cyclistic_complete_data_sorted$ride_length != "0", ]
cyclistic_complete_data_without_0
In order to make the data more reliable, data transformation can be done by extracting day,month,quarter from Start_Date column into different columns named Day, Month and Quarter.
cyclistic_complete_data_quarter <- cyclistic_complete_data_without_0 %>%
add_column(Quarter = quarter(cyclistic_complete_data_without_0$Start_Date,type = "quarter", fiscal_start = 1), Month = months(as.Date(cyclistic_complete_data_without_0$Start_Date)),
Day = day(as.Date(cyclistic_complete_data_without_0$Start_Date,"%d-%m-%y")))
head(cyclistic_complete_data_quarter[c("Start_Date","Quarter","Month","Day")])
cyclistic_final_data <- cyclistic_complete_data_quarter
cyclistic_final_data
When the above data is checked for inconsistency it is found that there are some records whose value is NA for the entire row.. So, final data set for the analysis is prepared by selecting only the rows with complete values. Also, to make the data more easier to understand, added a new column named Day_of_week with names of the day extracted from start_Date column.
cyclistic_final_data_v2 <- cyclistic_final_data[complete.cases(cyclistic_final_data),]
cyclistic_final_data_v2$Day_of_week <- weekdays(as.Date(cyclistic_final_data_v2$Start_Date))
cyclistic_final_data_v2
Now, lets get into the descriptive analysis…
Calculate the overall summary of ride length and group them on the basis of customer type.
average_ride_length <- mean(cyclistic_final_data_v2$ride_length) #average ride length (total ride length / rides)
max_ride_length <- max(cyclistic_final_data_v2$ride_length) #longest ride
min_ride_length <- min(cyclistic_final_data_v2$ride_length) #shortest ride
# summarize the above calculated values and group them according to customer type i.e, member/casual column.
cyclistic_summary_groupby_customer <- cyclistic_final_data_v2 %>%
group_by(member_casual) %>%
summarise(average_ride_length=mean(ride_length),
min_ride_length=min(ride_length),
max_ride_length=max(ride_length))
cyclistic_summary_groupby_customer
cyclistic_summary_groupby_customer_month <- cyclistic_final_data_v2 %>%
group_by(member_casual,Month) %>%
arrange(member_casual,Month) %>%
summarise(average_ride_length=mean(ride_length),
min_ride_length=min(ride_length),
max_ride_length=max(ride_length))
cyclistic_summary_groupby_customer_month
library(dplyr)
cyclistic_final_data_v2$Month <- ordered(cyclistic_final_data_v2$Month,
levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
cyclistic_final_data_v2 %>%
group_by(member_casual, Month) %>%
summarise(number_of_rides = n(), average_ride_length = mean(ride_length), min_ride_length=min(ride_length),
max_ride_length=max(ride_length), .groups="drop") %>%
arrange(member_casual, Month)
cyclistic_final_data_v2$Day_of_week <- ordered(cyclistic_final_data_v2$Day_of_week,
levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
cyclistic_final_data_v2 %>%
group_by(member_casual, Day_of_week) %>% #groups by member_casual
summarise(number_of_rides = n(), average_ride_length = mean(ride_length),.groups="drop") %>%
arrange(member_casual, Day_of_week)
cyclistic_final_data_v2 %>%
group_by(member_casual, rideable_type) %>% #groups by member_casual
summarise(number_of_rides = n(), average_ride_length = mean(ride_length),.groups="drop") %>%
arrange(member_casual, rideable_type)
cyclistic_final_data_v2 %>%
group_by(member_casual, rideable_type, Quarter) %>% #groups by member_casual
summarise(number_of_rides = n(), average_ride_length = mean(ride_length),.groups="drop") %>%
arrange(member_casual, rideable_type, Quarter)
cyclistic_final_data_v2 %>%
group_by(member_casual, Month, rideable_type) %>% #groups by member_casual
summarise(number_of_rides = n(), average_ride_length = mean(ride_length),.groups="drop") %>%
arrange(member_casual, Month, rideable_type)
Visualizations are useful to share the findings and key insights of the analysis to the stake holders. Tableau is used to create the visualizations and dashboard.
Initially, the cleaned dataset is loaded into Tableau in a new workbook. Then variety of charts are created.
The dashboard is given below:
!(C:AnalysisData Analytics Coursestudiesstudy 1dashboard img.png)
During the year 2023, a total of 4.24M rides are taken by the customers with an average ride length of 13 minutes.
Around 75% of rides(around 2.7M) were taken by annual members, but the average ride length(11 minutes) was less than that of casual riders(16 minutes).
There is an interesting pattern for the rides taken by both type of riders. The annual members tends to ride more on weekdays in contrast to the casual riders who rides on weekends more.
Total number of rides are high during the spring and summer seasons of the year with a peak in months July - August.
Classic bikes are mostly used throughout the year irrespective of the customer type which is followed by electric bikes and then docked bikes.
Docked bikes are only used by the casual riders but it is used for the long trips instead of classic or electric bikes. At the same time, the docked bikes are used throughout a week with an average ride length of 25 minutes which is longer than the trips with other two bikes.
Both riders use classic and electric bikes for their trips in a similar manner, but docked bikes are used in the first 3 quarters of the year.
Analyzing the above insights it is evident that annual member and casual riders are mainly differed in the purpose of their rides. Annual members uses the system for commute to work or studies with a steady average ride length and number of rides. On the other hand, casual riders uses bikes for long trips during weekends and they mostly uses docked bikes.
Annual members contribute largely to the ride but their average ride duration is less than that of casual riders. According to the analysis done, in order to convert the casual riders to annual members, new marketing strategy should be designed so that they can benefit more from the membership rides than the casual ones. For this, new travel plans, offers and discounts can be introduced for the annual members who uses docked bikes and long trips.Design more plans for specific group of customers such as students, senior citizens, etc. To reach out these plans to the public, more digital advertisements can be placed on public areas and bike stations.
More analysis should be done based on customer data such as age, travel destinations, etc to know more about their ride behavior. Surveys and feedback from the customers about their ride options and plans can be also taken and analysed to plan new marketing strategies.